Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS by Robby955 · Pull Request #796 · openai/parameter-golf

Robby955 · 2026-03-26T01:41:00Z

Record: Empirical Bayes N-gram Mixing -- val_bpb=0.2292

What this does

Instead of hand-tuning alpha multipliers for each n-gram order (my previous submission at 0.2880), I replaced the mixing strategy with Bayesian posterior inference.

The formula:

p(token) = (ngram_count + c * neural_prob) / (total + c)

This is the Dirichlet-Multinomial posterior predictive. The neural model is the prior, n-gram counts are the likelihood, concentration c controls the tradeoff. Applied recursively from bigram up to 15-gram, where each order's smoothed estimate becomes the next order's prior.

A single global concentration (c=5.0) handles the sparse-count problem that previously required hand-tuned per-order multipliers. The improvement is 0.059 BPB, which I didn't expect from replacing 14 tuned parameters with 1.

Results

Seed	BPB	Artifact
1337	0.22922259	14,845,997
2024	0.22923179	14,860,181
2025	0.22922912	14,846,933
Mean	0.22923
Std	0.000005

Ablation chain

Config	BPB	Delta
Neural model only (no cache)	1.1745	--
+ 7-gram backoff + prefill	0.6565	-0.518
+ extend to 15-gram	0.6189	-0.038
+ order-adaptive gating	0.4374	-0.182
+ complementary training (alpha=0.50)	0.3707	-0.067
+ per-order multipliers	0.2880	-0.083
+ Dirichlet smoothing (c=5.0)	0.2292	-0.059

What's novel

Using a neural LM as the base measure in hierarchical Bayesian n-gram smoothing. Traditional Bayesian LMs (MacKay & Peto 1995, Teh 2006) use uniform or unigram priors. This is the Dirichlet special case (discount=0) of the Pitman-Yor family, a sibling to Kneser-Ney, not a generalization of it.

What's borrowed

N-gram cache approach from the community (especially @deanbrr, @lukacf, @Asukabot0, @newjordan). Complementary training from @pentxayc. Per-order multiplier concept from @AayushBaniya2006 (now replaced by Dirichlet). The Bayesian smoothing formula itself is textbook.

Compliance

Constraint	Limit	Actual	Status
Train time	600s	560s	Pass
Eval time	600s	366s (max across seeds)	Pass
Artifact	16,000,000	14,860,181 (max)	Pass
Backward-looking cache	required	yes	Pass
Single-pass eval	required	yes	Pass

Technical details

11L transformer (3 shared x 3 loops + 2 unique, EBLS), 512d, 8 heads / 4 KV heads (GQA), complementary n-gram training (alpha=0.5), 15-order recursive Bayesian backoff with concentration=5.0, int6 GPTQ + LZMA compression. ~14.9 MB artifact.

Feedback welcome.

3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003) 8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB. Key innovation: distributed cache pre-fill using pure numpy. Each GPU rank pre-populates n-gram hash tables with ALL preceding token positions before scoring, producing results mathematically identical to single-GPU sequential evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).

hypery11 · 2026-03-26T06:35:56Z

nice 🔥🔥🔥🔥

@pentxayc

Add complementary training (from @pentxayc #803) and per-order multipliers (from @AayushBaniya2006 #809) on top of distributed prefill + 15-gram + order-adaptive gating. New 3-seed results: 0.28798 / 0.28804 / 0.28810 All seeds under 16MB, training under 560s, eval under 330s. Updated README with legality hedge, full ablation, credits.

CRITICAL FIX: Previously each of 8 GPU ranks only updated its n-gram cache with its own 1/8 of scored windows. Now ALL ranks update with the FULL chunk (same as mixer already does). PR openai#796 showed this costs ~0.31 BPP: "Without pre-fill, ranks 1-7 start with empty n-gram caches. This costs ~0.31 BPP." Expected: massive improvement from 8x more n-gram data per rank. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full-chunk n-gram cache sharing: 0.6913 -> 0.5865 (-0.105 BPB) This confirms PR openai#796's finding that rank-local caches lose ~0.1+ BPB. WARNING: artifact=16.25MB (over 16MB limit for this seed). Need to increase pruning from 3% to 4%, or reduce bigram_vocab_size, to ensure all seeds fit. Eval time: 492s (within budget). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ipliers Novel improvement over uniform entropy threshold: - Per-order entropy center: order 2 → 5.0 (trust only when confused), order max → 2.0 (trust even when model is OK) - Per-order alpha multiplier: order 2 → 0.3× (suppress noise), order max → 2.0× (boost precision) - Linear interpolation between orders for smooth transition Inspired by PR openai#796's ablation showing -0.182 BPP from order-adaptive gating alone. Our implementation is continuous (sigmoid per order) rather than discrete thresholds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… validated) Replace per-order multipliers with recursive Dirichlet posterior predictive. Neural model as informative prior, single concentration c=5.0. 3-seed mean: 0.22923 BPB (std 0.000005). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Robby955 · 2026-03-26T18:43:12Z

Updated submission: 0.6567 → 0.2880 → 0.2292 BPB (3-seed mean, std 0.000005).

Replaced per-order multipliers with Dirichlet-Multinomial posterior smoothing (single concentration c=5.0). All logs, code, and submission.json updated in latest commit.

MatoTeziTanka · 2026-03-26T19:35:45Z

This is one of the cleanest submissions in the competition. Replacing 14 hand-tuned per-order alpha parameters with a single Dirichlet concentration (c=5.0) is elegant — the recursive posterior predictive naturally handles sparsity at high orders without any manual intervention. The math does what entropy thresholds and sigmoid gating are trying to approximate.

The 3-seed std of 0.000005 is also remarkable — tightest we've seen across all submissions.

Nice work.

Robby955 · 2026-03-27T20:56:41Z

Superseded by neural-track work.

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

MatoTeziTanka mentioned this pull request Mar 26, 2026

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache #769

Closed

10 tasks

Update record: 0.4374 BPB — 15-gram + distributed prefill + order-ada…

d19036d

…ptive gating 3-seed validated (seeds 1337, 2024, 2025, std 0.0003). Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB) and order-adaptive entropy gating (-0.18 BPB).

Robby955 changed the title ~~Record: 0.6567 BPB — Prefill Cache + 7-Gram Entropy-Adaptive + EBLS~~ Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS Mar 26, 2026

Idan3011 mentioned this pull request Mar 26, 2026

Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) #810

Closed

Robby955 changed the title ~~Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS~~ Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS Mar 26, 2026

Robby955 changed the title ~~Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS~~ Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS Mar 26, 2026

newjordan mentioned this pull request Mar 26, 2026

Illegal submissions megathread #677

Open

sofiabod mentioned this pull request Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) #890

Open

Robby955 mentioned this pull request Mar 26, 2026

Record: Two-Level Dirichlet Posterior Mixing with Per-Order OBCL -- 0.1156 BPB #900

Closed

Idan3011 mentioned this pull request Mar 27, 2026

Normalized N-gram + Bayesian First-Match (val_bpb 0.3922) #972

Closed

Robby955 closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/prefill-7gram-ebls-0.6567

Robby955 commented Mar 26, 2026 •

edited

Loading

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

Robby955 commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

Robby955 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Robby955 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Empirical Bayes N-gram Mixing -- val_bpb=0.2292

What this does

Results

Ablation chain

What's novel

What's borrowed

Compliance

Technical details

Uh oh!

hypery11 commented Mar 26, 2026

Uh oh!

Robby955 commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026

Uh oh!

Robby955 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Robby955 commented Mar 26, 2026 •

edited

Loading